numpy - 1.25.0
pandas - 2.0.2
scikit-learn - 1.2.2
scipy - 1.10.1
seaborn - 0.12.2
matplotlib - 3.7.1
statsmodels - 0.14.0
import warnings
import pandas as pd
import numpy as np
import scipy.stats as st
import matplotlib.pyplot as plt
import seaborn as sns
from scipy.stats import zscore, randint
%matplotlib inline
from scipy.stats import reciprocal, uniform
from scipy.stats import zscore
import statsmodels.formula.api as smf
from imblearn.over_sampling import SMOTE
from sklearn.linear_model import LogisticRegression, LinearRegression
from sklearn.preprocessing import PolynomialFeatures, StandardScaler, LabelEncoder
from sklearn.model_selection import train_test_split, cross_val_score, GridSearchCV, RandomizedSearchCV
from sklearn import metrics
from sklearn.metrics import accuracy_score, recall_score, roc_curve, auc, precision_score, f1_score, confusion_matrix, roc_auc_score, classification_report
from sklearn.neighbors import KNeighborsClassifier
from sklearn.svm import SVC
DOMAIN: Medical.
CONTEXT: Medical research university X is undergoing a deep research on patients with certain conditions. University has an internal AI team. Due to confidentiality the patient’s details and the conditions are masked by the client by providing different datasets to the AI team for developing a AIML model which can predict the condition of the patient depending on the received test results.
DATA DESCRIPTION: The data consists of biomechanics features of the patients according to their current conditions. Each patient is represented in the data set by six biomechanics attributes derived from the shape and orientation of the condition to their body part.
DATA DICTIONARY:
PROJECT OBJECTIVE: To Demonstrate the ability to fetch, process and leverage data to generate useful predictions by training Supervised Learning algorithms.
df1=pd.read_csv('Normal.csv')
df2=pd.read_csv('Type_H.csv')
df3=pd.read_csv('Type_S.csv')
# Print shape of each DataFrame
print("Shape of df1:", df1.shape)
print("Shape of df2:", df2.shape)
print("Shape of df3:", df3.shape)
# Print columns of each DataFrame
print("Columns of df1:", df1.columns)
print("Columns of df2:", df2.columns)
print("Columns of df3:", df3.columns)
Shape of df1: (100, 7)
Shape of df2: (60, 7)
Shape of df3: (150, 7)
Columns of df1: Index(['P_incidence', 'P_tilt', 'L_angle', 'S_slope', 'P_radius', 'S_Degree',
'Class'],
dtype='object')
Columns of df2: Index(['P_incidence', 'P_tilt', 'L_angle', 'S_slope', 'P_radius', 'S_Degree',
'Class'],
dtype='object')
Columns of df3: Index(['P_incidence', 'P_tilt', 'L_angle', 'S_slope', 'P_radius', 'S_Degree',
'Class'],
dtype='object')
column_names_df1 = df1.columns
column_names_df2 = df2.columns
column_names_df3 = df3.columns
column_names_equal = all(column_names_df1 == column_names_df2) and all(column_names_df2 == column_names_df3)
if column_names_equal:
print("All DataFrames have the same column names.")
else:
print("DataFrames have different column names.")
All DataFrames have the same column names.
print("Data Types of df1:", df1.dtypes)
print()
print("Data Types of df2:", df2.dtypes)
print()
print("Data Types of df3:", df3.dtypes)
Data Types of df1: P_incidence float64 P_tilt float64 L_angle float64 S_slope float64 P_radius float64 S_Degree float64 Class object dtype: object Data Types of df2: P_incidence float64 P_tilt float64 L_angle float64 S_slope float64 P_radius float64 S_Degree float64 Class object dtype: object Data Types of df3: P_incidence float64 P_tilt float64 L_angle float64 S_slope float64 P_radius float64 S_Degree float64 Class object dtype: object
P_incidence, P_tilt, L_angle, S_slope, P_radius, and S_Degree have data type float64. This suggests that these columns likely contain numerical values with decimal precision.Class has data type object, which typically represents string or categorical data. In this case, it suggests that the Class column may contain categorical labels or class names.class_var_df1 = df1['Class'].value_counts()
class_var_df2 = df2['Class'].value_counts()
class_var_df3 = df3['Class'].value_counts()
print("Class variation in df1:")
print(class_var_df1)
print("\nClass variation in df2:")
print(class_var_df2)
print("\nClass variation in df3:")
print(class_var_df3,"\n")
Class variation in df1: Class Normal 73 Nrmal 27 Name: count, dtype: int64 Class variation in df2: Class Type_H 37 type_h 23 Name: count, dtype: int64 Class variation in df3: Class Type_S 133 tp_s 17 Name: count, dtype: int64
# Identifying unique variations in 'Class' for each DataFrame
normal_var = df1['Class'].unique()
type_h_var = df2['Class'].unique()
type_s_var = df3['Class'].unique()
print(normal_var)
print()
print(type_h_var)
print()
print(type_s_var)
['Normal' 'Nrmal'] ['Type_H' 'type_h'] ['Type_S' 'tp_s']
# Defining a mapping to unify the variations
mapping = {
'Normal': 'normal',
'Nrmal': 'normal',
}
mapping2 = {
'Type_H': 'type_h'
}
mapping3 = {
'Type_S': 'type_s',
'tp_s': 'type_s'
}
# Applying the mapping to update 'Class' values in each DataFrame
df1['Class'] = df1['Class'].map(mapping)
df2['Class'] = df2['Class'].map(mapping2)
df3['Class'] = df3['Class'].map(mapping3)
Therefore we have unified all the variations in ‘Class’ feature for all the 3 DataFrames.
# Combining or concatenating the dataframes row-wise
df_comb = pd.concat([df1, df2, df3], ignore_index=True)
#Reset the index of the combined DataFrame
df_comb.reset_index(drop=True, inplace=True)
# checking the shape of the data
df_comb.shape
(310, 7)
As we can notice that the DataFrames have been combined appropriately.
df_comb.sample(n=5)
| P_incidence | P_tilt | L_angle | S_slope | P_radius | S_Degree | Class | |
|---|---|---|---|---|---|---|---|
| 222 | 80.074914 | 48.069531 | 52.403439 | 32.005383 | 110.709912 | 67.727316 | type_s |
| 294 | 72.643850 | 18.929117 | 68.000000 | 53.714733 | 116.963416 | 25.384247 | type_s |
| 0 | 38.505273 | 16.964297 | 35.112814 | 21.540976 | 127.632875 | 7.986683 | normal |
| 92 | 54.600316 | 21.488974 | 29.360216 | 33.111342 | 118.343321 | -1.471067 | normal |
| 203 | 49.782121 | 6.466805 | 53.000000 | 43.315316 | 110.864783 | 25.335647 | type_s |
null_percentage = (df_comb.isnull().sum() / len(df_comb)) * 100
# Print the feature-wise percentage of null values
print(null_percentage)
P_incidence 0.000000 P_tilt 0.000000 L_angle 0.000000 S_slope 0.000000 P_radius 0.000000 S_Degree 0.000000 Class 7.419355 dtype: float64
We notice that 7.419355% of the values in the column 'Class' are missing.
df_comb.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| P_incidence | 310.0 | 60.496653 | 17.236520 | 26.147921 | 46.430294 | 58.691038 | 72.877696 | 129.834041 |
| P_tilt | 310.0 | 17.542822 | 10.008330 | -6.554948 | 10.667069 | 16.357689 | 22.120395 | 49.431864 |
| L_angle | 310.0 | 51.930930 | 18.554064 | 14.000000 | 37.000000 | 49.562398 | 63.000000 | 125.742385 |
| S_slope | 310.0 | 42.953831 | 13.423102 | 13.366931 | 33.347122 | 42.404912 | 52.695888 | 121.429566 |
| P_radius | 310.0 | 117.920655 | 13.317377 | 70.082575 | 110.709196 | 118.268178 | 125.467674 | 163.071041 |
| S_Degree | 310.0 | 26.296694 | 37.559027 | -11.058179 | 1.603727 | 11.767934 | 41.287352 | 418.543082 |
From the above table, no other values except in column 'S_Degree' appears to be extremely unusual or anomalous. The statistics (mean, standard deviation, min, 25th percentile, 50th percentile, 75th percentile, and max) for the features 'P_incidence', 'P_tilt', 'L_angle', 'S_slope', and 'P_radius' seem within a reasonable range and follow a typical distribution.
For the column 'S_Degree' The data is highly variable, as indicated by the large standard deviation. There may be some outliers present, as the maximum value is significantly higher than the 75th percentile (Q3).
sns.heatmap(df_comb.select_dtypes(include=['float64']).corr(), vmin=-1, vmax=1, annot=True, cmap='RdPu');
From the above Heatmap, we can observe the following insights:
Positive Correlation: The variables P_incidence, P_tilt, L_angle, and S_slope have positive correlations with each other. This indicates that as one variable increases, the others also tend to increase. The highest positive correlation is observed between S_slope and P_incidence (0.815).Negative Correlation: The variable P_radius has a negative correlation with P_incidence (-0.247). This suggests that as P_radius decreases, P_incidence tends to increase. However, the negative correlation is relatively weak.Weak Correlation: The variables P_radius and S_Degree have a weak correlation (-0.026). This implies that there is little to no linear relationship between these variables.Moderate Correlation: The variables P_tilt and L_angle have a moderate correlation (0.433). This indicates that there is a moderate positive relationship between these two variables.warnings.filterwarnings("ignore")
# Creating the pairplot
sns.pairplot(df_comb, hue='Class', palette="Set1")
plt.show()
It is observed that 'type_s' Class is relatively more prominent in the DataFrame than 'type_h' and 'normal' Classes.
sns.jointplot(data=df_comb, x='P_incidence', y='S_slope', kind='reg');
There seems to be a strong positive correlation between 'P_incidence' and 'S_slope' which means that as one variable increases other also tend to increase.
# Calculate skewness scores for numeric columns
num_columns = df_comb.select_dtypes(include=[float, int]).columns
skew_scores = df_comb[num_columns].skew()
# Print skewness scores for numeric columns
print("Skewness Scores:")
print(skew_scores)
plt.figure(figsize=(9, 6))
sns.boxplot(data=df_comb[num_columns])
plt.xticks(rotation=90);
Skewness Scores: P_incidence 0.520440 P_tilt 0.676553 L_angle 0.599451 S_slope 0.792577 P_radius -0.176835 S_Degree 4.317954 dtype: float64
We can see that all the numerical features in the data set have outliers present in them. Also all the columns are right(+ve) skewed except for the column 'P_radius' which is left(-ve) skewed.
Based on the skewness scores for the columns:
P_incidence: The column "P_incidence" has a positive skewness score of 0.520440, indicating a slight right skew. This suggests that the majority of the data points are concentrated towards lower values, with a few higher values causing a longer tail on the right side of the distribution.P_tilt: The column "P_tilt" has a positive skewness score of 0.676553, indicating a moderate right skew. This implies that the distribution is concentrated towards lower values, with a longer tail on the right side, where a few higher values are present.L_angle: The column "L_angle" has a positive skewness score of 0.599451, indicating a moderate right skew. This suggests that the majority of the data points are concentrated towards lower values, with a longer tail on the right side due to a few higher values.S_slope: The column "S_slope" has a positive skewness score of 0.792577, indicating a relatively higher right skew. This suggests that the distribution is concentrated towards lower values, with a pronounced tail on the right side due to a few extreme higher values.P_radius: The column "P_radius" has a negative skewness score of -0.176835, indicating a slight left skew. This implies that the distribution is concentrated towards higher values, with a few lower values causing a longer tail on the left side.S_Degree: The column "S_Degree" has a positive skewness score of 4.317954, indicating a significant right skew. This suggests that the majority of the data points are concentrated towards lower values, with an extremely long tail on the right side due to a few very high values.Let's drop the row with the maximum value of column S_Degree, which is an extreme outlier, from the dataset as it its highy advised to drop very extreme datapoints so that it does not impact the model.
# Find the index of the row with the maximum value in 'S_Degree'
row_to_drop = df_comb['S_Degree'].idxmax()
# Drop the row
df_comb.drop(row_to_drop, axis=0, inplace=True)
sns.boxplot(data=df_comb, y='S_Degree');
df_comb['S_Degree'].describe().T
count 309.000000 mean 25.027289 std 30.234211 min -11.058179 25% 1.594748 50% 11.463223 75% 40.880923 max 148.753711 Name: S_Degree, dtype: float64
Therefore the statistics (mean, standard deviation, min, 25th percentile, 50th percentile, 75th percentile, and max) for the S_Degree is now within the reasonable range.
Now before splitting the data into X and Y lets check if the dataset have and missing values or any duplicated entries.
# Check for the duplicated entries
df_comb.duplicated().sum()
0
Therefore there are no duplicate entries.
# Check for the missing values
pd.DataFrame({'Count':df_comb.isnull().sum()[df_comb.isnull().sum()>0],'Percentage':(df_comb.isnull().sum()[df_comb.isnull().sum()>0]/df_comb.shape[0])*100})
| Count | Percentage | |
|---|---|---|
| Class | 23 | 7.443366 |
Let's deal with the missing values by removing the rows with missing target values. Here's why:-
Impact on class distribution: Imputing the missing values in the target variable can potentially introduce bias and distort the class distribution, especially if the missing values are imputed with the majority class (0's) since they are more prevalent. Removing the rows with missing values ensures that the original class distribution is preserved.Preserving data integrity: Imputing missing values in the target variable is often a challenging task, especially when dealing with imbalanced datasets. Since the target variable is crucial for model training and evaluation, imputing missing values might introduce uncertainty and compromise the integrity of the data.Sample representativeness: By removing the rows with missing values, you retain a representative sample that accurately reflects the available data. This can help ensure that your model is trained on reliable and unbiased information.df_comb = df_comb.dropna(subset=['Class'])
df_comb.isnull().sum()
P_incidence 0 P_tilt 0 L_angle 0 S_slope 0 P_radius 0 S_Degree 0 Class 0 dtype: int64
df_comb.shape
(286, 7)
We can notice that all the rows having missing target values have been dropped properly and dataset is now ready for the split.
# Converting data type of Class to Category
df_comb['Class'] = df_comb.Class.astype('category')
# Splitting data into x and y
X = df_comb.drop(labels= "Class" , axis = 1)
y = df_comb["Class"]
X.head()
| P_incidence | P_tilt | L_angle | S_slope | P_radius | S_Degree | |
|---|---|---|---|---|---|---|
| 0 | 38.505273 | 16.964297 | 35.112814 | 21.540976 | 127.632875 | 7.986683 |
| 1 | 54.920858 | 18.968430 | 51.601455 | 35.952428 | 125.846646 | 2.001642 |
| 2 | 44.362490 | 8.945435 | 46.902096 | 35.417055 | 129.220682 | 4.994195 |
| 3 | 48.318931 | 17.452121 | 48.000000 | 30.866809 | 128.980308 | -0.910941 |
| 4 | 45.701789 | 10.659859 | 42.577846 | 35.041929 | 130.178314 | -3.388910 |
Before splitting the data into train and test set we first have to convert the features into z scores as we do not know what units / scales were used and store them in new dataframe.
X_scaled = X.apply(zscore) # convert all attributes to Z scale
X_scaled.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| P_incidence | 286.0 | 9.937661e-17 | 1.001753 | -2.102482 | -0.810374 | -0.037323 | 0.776786 | 3.397698 |
| P_tilt | 286.0 | 1.739091e-16 | 1.001753 | -2.374309 | -0.698326 | -0.151715 | 0.443510 | 3.163122 |
| L_angle | 286.0 | 2.981298e-16 | 1.001753 | -2.141048 | -0.745539 | -0.113212 | 0.580879 | 3.942045 |
| S_slope | 286.0 | 2.484415e-16 | 1.001753 | -2.478113 | -0.770592 | -0.046622 | 0.750572 | 2.911626 |
| P_radius | 286.0 | 1.440961e-15 | 1.001753 | -3.509007 | -0.531486 | 0.040173 | 0.566407 | 3.308732 |
| S_Degree | 286.0 | 0.000000e+00 | 1.001753 | -1.236573 | -0.820846 | -0.199908 | 0.523066 | 3.978153 |
Therefore data has be scaled.
In the train_test_split function of scikit-learn, the stratify parameter is an optional argument that allows you to perform stratified sampling during the train-test split.
Stratified sampling is a sampling technique where the class distribution in the dataset is preserved in the train and test splits. This is particularly useful when dealing with imbalanced datasets, where one class may have significantly fewer samples than others. It ensures that the train and test sets have similar class distributions, which can help in achieving more reliable model evaluation and performance estimation.
Note that it is important to have a sufficient number of samples in each class for stratified sampling to work effectively. If any class has too few samples, you may encounter warnings or errors during the split process.
By using the stratify parameter, you can ensure that your train and test sets maintain the class distribution, which is beneficial for evaluating and training models on imbalanced datasets or when class proportions are important in your analysis.
For Example -
X_train, X_test, Y_train, Y_test = train_test_split(X_scaled, y, test_size=0.20, random_state=1, stratify=y)
# Split X and y into training and test set in 80:20 ratio
X_train, X_test, Y_train, Y_test = train_test_split(X_scaled, y, test_size=0.20, random_state=1)
X_train.head()
| P_incidence | P_tilt | L_angle | S_slope | P_radius | S_Degree | |
|---|---|---|---|---|---|---|
| 39 | -0.631409 | 0.275383 | -0.888962 | -1.084422 | 0.322667 | -0.801066 |
| 181 | 0.758766 | 0.363147 | 0.220469 | 0.732919 | 0.165946 | 1.557025 |
| 38 | -0.836695 | -0.877997 | -0.967339 | -0.415848 | 0.193147 | -0.822531 |
| 105 | -1.259354 | -0.349025 | -1.535422 | -1.424890 | 0.908065 | -0.802953 |
| 309 | -0.780479 | -0.102200 | -0.925482 | -0.976817 | -1.690735 | 0.049130 |
We have split the data into train and test set.
knn = KNeighborsClassifier(n_neighbors=5) # By default metric is "euclidean" and weights is "uniform"
knn.fit(X_train, Y_train)
KNeighborsClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
KNeighborsClassifier()
# Make predictions on the test data
y_pred = knn.predict(X_test)
We have trained the model using K-NN classifier.
train_pred = knn.predict(X_train)
precision_t = precision_score(Y_train, train_pred, average='macro')
recall_t = recall_score(Y_train, train_pred, average='macro')
f1_t = f1_score(Y_train, train_pred, average='macro')
accuracy_t = accuracy_score(Y_train, train_pred)
# Print the performance metrics
print("Performance Metrics: ")
print("Precision:", precision_t)
print("Recall:", recall_t)
print("F1-Score:", f1_t)
print("Accuracy:", accuracy_t)
print("Score:", knn.score(X_train, Y_train))
print()
print("Confusion Matrix:")
cm=metrics.confusion_matrix(Y_train, train_pred, labels=['normal','type_h','type_s'])
df_cm = pd.DataFrame(cm, index = [i for i in ['normal','type_h','type_s']],
columns = [i for i in ['Predicted normal','Predicted type_h','Predicted type_s']])
plt.figure(figsize = (8,6))
sns.heatmap(df_cm, annot=True ,fmt='g');
Performance Metrics: Precision: 0.8645231071779745 Recall: 0.7690940244163632 F1-Score: 0.7924438896978944 Accuracy: 0.868421052631579 Score: 0.868421052631579 Confusion Matrix:
precision = precision_score(Y_test, y_pred, average='macro')
recall = recall_score(Y_test, y_pred, average='macro')
f1 = f1_score(Y_test, y_pred, average='macro')
accuracy = accuracy_score(Y_test, y_pred)
# Print the performance metrics
print("Performance Metrics: ")
print("Precision:", precision)
print("Recall:", recall)
print("F1-Score:", f1)
print("Accuracy:", accuracy)
print("Score:", knn.score(X_test, Y_test))
print()
print("Confusion Matrix:")
cm=metrics.confusion_matrix(Y_test, y_pred, labels=['normal','type_h','type_s'])
df_cm = pd.DataFrame(cm, index = [i for i in ['normal','type_h','type_s']],
columns = [i for i in ['Predicted normal','Predicted type_h','Predicted type_s']])
plt.figure(figsize = (8,6))
sns.heatmap(df_cm, annot=True ,fmt='g');
print()
Performance Metrics: Precision: 0.8452380952380952 Recall: 0.636029411764706 F1-Score: 0.6225940225940225 Accuracy: 0.8103448275862069 Score: 0.8103448275862069 Confusion Matrix:
knn = KNeighborsClassifier(n_neighbors=5, metric='manhattan')
knn.fit(X_train, Y_train)
KNeighborsClassifier(metric='manhattan')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
KNeighborsClassifier(metric='manhattan')
# Make predictions on the test data
y_pred = knn.predict(X_test)
precision_2 = precision_score(Y_test, y_pred, average='macro')
recall_2 = recall_score(Y_test, y_pred, average='macro')
f1_2 = f1_score(Y_test, y_pred, average='macro')
accuracy_2 = accuracy_score(Y_test, y_pred)
# Print the performance metrics
print("Performance Metrics: ")
print("Precision:", precision_2)
print("Recall:", recall_2)
print("F1-Score:", f1_2)
print("Accuracy:", accuracy_2)
print("Score:", knn.score(X_test, Y_test))
print()
# Changes in performance metrics with respect to base model
precision_change = (precision_2 - precision) * 100
recall_change = (recall_2 - recall) * 100
f1_change = (f1_2 - f1) * 100
accuracy_change = (accuracy_2 - accuracy) * 100
# Print the changes in performance
print("Change in Performance metrics:")
print(f"Precision: {precision_change:.2f}%")
print(f"Recall: {recall_change:.2f}%")
print(f"F1-Score: {f1_change:.2f}%")
print(f"Accuracy: {accuracy_change:.2f}%",'\n')
Performance Metrics: Precision: 0.835064935064935 Recall: 0.6262254901960784 F1-Score: 0.6113229934740233 Accuracy: 0.7931034482758621 Score: 0.7931034482758621 Change in Performance metrics: Precision: -1.02% Recall: -0.98% F1-Score: -1.13% Accuracy: -1.72%
knn = KNeighborsClassifier(n_neighbors=5, metric='chebyshev')
knn.fit(X_train, Y_train)
y_pred = knn.predict(X_test)
precision_3 = precision_score(Y_test, y_pred, average='macro')
recall_3 = recall_score(Y_test, y_pred, average='macro')
f1_3 = f1_score(Y_test, y_pred, average='macro')
accuracy_3 = accuracy_score(Y_test, y_pred)
# Print the performance metrics
print("Performance Metrics: ")
print("Precision:", precision_3)
print("Recall:", recall_3)
print("F1-Score:", f1_3)
print("Accuracy:", accuracy_3)
print("Score:", knn.score(X_test, Y_test))
print()
# Changes in performance metrics with respect to base model
precision_change = (precision_3 - precision) * 100
recall_change = (recall_3 - recall) * 100
f1_change = (f1_3 - f1) * 100
accuracy_change = (accuracy_3 - accuracy) * 100
# Print the changes in performance
print("Change in Performance metrics:")
print(f"Precision: {precision_change:.2f}%")
print(f"Recall: {recall_change:.2f}%")
print(f"F1-Score: {f1_change:.2f}%")
print(f"Accuracy: {accuracy_change:.2f}%",'\n')
Performance Metrics: Precision: 0.6096491228070176 Recall: 0.5943627450980392 F1-Score: 0.5844155844155844 Accuracy: 0.7758620689655172 Score: 0.7758620689655172 Change in Performance metrics: Precision: -23.56% Recall: -4.17% F1-Score: -3.82% Accuracy: -3.45%
Comparing the performance metrics for the three models base model (euclidean distance model), Manhattan distance model, and Chebyshev distance model, we can make the following observations:
Precision: The base model has the highest precision with a value of 0.8452, followed by the Manhattan distance model with a precision of 0.8351, and the Chebyshev distance model with a precision of 0.6096. This indicates that the base model and the Manhattan distance model perform better in correctly identifying positive instances compared to the Chebyshev distance model.Recall: The base model also has the highest recall with a value of 0.6360, followed by the Manhattan distance model with a recall of 0.6262, and the Chebyshev distance model with a recall of 0.5944. This suggests that the base model and the Manhattan distance model are better at capturing true positive instances compared to the Chebyshev distance model.F1-Score: The base model has the highest F1-score with a value of 0.6226, followed by the Manhattan distance model with an F1-score of 0.6113, and the Chebyshev distance model with an F1-score of 0.5844. The F1-score considers both precision and recall, and a higher value indicates a better balance between the two metrics. Therefore, the base model and the Manhattan distance model perform better overall in terms of F1-score.Accuracy: The base model also has the highest accuracy with a value of 0.8103, followed by the Manhattan distance model with an accuracy of 0.7931, and the Chebyshev distance model with an accuracy of 0.7759. Accuracy measures the overall correctness of the model's predictions, and the base model performs better in terms of accuracy.Based on these observations, it can be concluded that the base model performs better in terms of precision, recall, F1-score, and accuracy compared to the models using Manhattan and Chebyshev distance metrics. Therefore, the base model may be considered the most effective in predicting the condition of the patients based on the given test results.
knn = KNeighborsClassifier(n_neighbors=3)
knn.fit(X_train, Y_train)
y_pred = knn.predict(X_test)
precision_4 = precision_score(Y_test, y_pred, average='macro')
recall_4 = recall_score(Y_test, y_pred, average='macro')
f1_4 = f1_score(Y_test, y_pred, average='macro')
accuracy_4 = accuracy_score(Y_test, y_pred)
# Print the performance metrics
print("Performance Metrics: ")
print("Precision:", precision_4)
print("Recall:", recall_4)
print("F1-Score:", f1_4)
print("Accuracy:", accuracy_4)
print("Score:", knn.score(X_test, Y_test))
print()
# Changes in performance metrics with respect to base model
precision_change = (precision_4 - precision) * 100
recall_change = (recall_4 - recall) * 100
f1_change = (f1_4 - f1) * 100
accuracy_change = (accuracy_4 - accuracy) * 100
# Print the changes in performance
print("Change in Performance metrics:")
print(f"Precision: +{precision_change:.2f}%","Improvement")
print(f"Recall: +{recall_change:.2f}%","Improvement")
print(f"F1-Score: +{f1_change:.2f}%","Improvement")
print(f"Accuracy: +{accuracy_change:.2f}%","Improvement \n")
Performance Metrics: Precision: 0.8742540494458653 Recall: 0.6776960784313726 F1-Score: 0.6540137422490364 Accuracy: 0.8448275862068966 Score: 0.8448275862068966 Change in Performance metrics: Precision: +2.90% Improvement Recall: +4.17% Improvement F1-Score: +3.14% Improvement Accuracy: +3.45% Improvement
knn = KNeighborsClassifier(n_neighbors=7)
knn.fit(X_train, Y_train)
y_pred = knn.predict(X_test)
precision_5 = precision_score(Y_test, y_pred, average='macro')
recall_5 = recall_score(Y_test, y_pred, average='macro')
f1_5 = f1_score(Y_test, y_pred, average='macro')
accuracy_5 = accuracy_score(Y_test, y_pred)
# Print the performance metrics
print("Performance Metrics: ")
print("Precision:", precision_5)
print("Recall:", recall_5)
print("F1-Score:", f1_5)
print("Accuracy:", accuracy_5)
print("Score:", knn.score(X_test, Y_test))
print()
# Changes in performance metrics with respect to base model
precision_change = (precision_5 - precision) * 100
recall_change = (recall_5 - recall) * 100
f1_change = (f1_5 - f1) * 100
accuracy_change = (accuracy_5 - accuracy) * 100
# Print the changes in performance
print("Change in Performance metrics:")
print(f"Precision: {precision_change:.2f}%")
print(f"Recall: {recall_change:.2f}%")
print(f"F1-Score: {f1_change:.2f}%")
print(f"Accuracy: {accuracy_change:.2f}%",'\n')
Performance Metrics: Precision: 0.49603174603174605 Recall: 0.573529411764706 F1-Score: 0.5305019305019304 Accuracy: 0.7758620689655172 Score: 0.7758620689655172 Change in Performance metrics: Precision: -34.92% Recall: -6.25% F1-Score: -9.21% Accuracy: -3.45%
knn = KNeighborsClassifier(n_neighbors=9)
knn.fit(X_train, Y_train)
y_pred = knn.predict(X_test)
precision_6 = precision_score(Y_test, y_pred, average='macro')
recall_6 = recall_score(Y_test, y_pred, average='macro')
f1_6 = f1_score(Y_test, y_pred, average='macro')
accuracy_6 = accuracy_score(Y_test, y_pred)
# Print the performance metrics
print("Performance Metrics: ")
print("Precision:", precision)
print("Recall:", recall)
print("F1-Score:", f1)
print("Accuracy:", accuracy_6)
print("Score:", knn.score(X_test, Y_test))
print()
# Changes in performance metrics with respect to base model
precision_change = (precision_6 - precision) * 100
recall_change = (recall_6 - recall) * 100
f1_change = (f1_6 - f1) * 100
accuracy_change = (accuracy_6 - accuracy) * 100
# Print the changes in performance
print("Change in Performance metrics:")
print(f"Precision: {precision_change:.2f}%")
print(f"Recall: +{recall_change:.2f}%","Improvement")
print(f"F1-Score: +{f1_change:.2f}%","Improvement")
print(f"Accuracy: {accuracy_change:.2f}%","\n")
Performance Metrics: Precision: 0.8452380952380952 Recall: 0.636029411764706 F1-Score: 0.6225940225940225 Accuracy: 0.8103448275862069 Score: 0.8103448275862069 Change in Performance metrics: Precision: -10.69% Recall: +2.08% Improvement F1-Score: +4.15% Improvement Accuracy: 0.00%
Observations from the above experiments-
Precision: The model with n_neighbors = 3 has the highest precision with a value of 0.8743, followed by the base model with a precision of 0.8452, n_neighbors = 9 with a precision of 0.8452, and n_neighbors = 7 with the lowest precision of 0.4960. This suggests that the models with n_neighbors = 3 and n_neighbors = 9 perform similarly in correctly identifying positive instances, while the model with n_neighbors = 7 has a lower precision.Recall: The model with n_neighbors = 3 has the highest recall with a value of 0.6777, followed by the base model and n_neighbors = 9 with a recall of 0.6360, and n_neighbors = 7 with the lowest recall of 0.5735. This indicates that the models with n_neighbors = 3 and n_neighbors = 9 capture more true positive instances compared to the model with n_neighbors = 7.F1-Score: The model with n_neighbors = 3 has the highest F1-score with a value of 0.6540, followed by the base model and n_neighbors = 9 with an F1-score of 0.6226, and n_neighbors = 7 with the lowest F1-score of 0.5305. The F1-score considers both precision and recall, and a higher value indicates a better balance between the two metrics.Accuracy: The model with n_neighbors = 3 has the highest accuracy with a value of 0.8448, followed by the base model and n_neighbors = 9 with an accuracy of 0.8103, and n_neighbors = 7 with the lowest accuracy of 0.7759. Accuracy measures the overall correctness of the model's predictions, and the model with n_neighbors = 3 performs the best in terms of accuracy.Based on these observations, it can be concluded that the model with n_neighbors = 3 performs better in terms of precision, recall, F1-score, and accuracy compared to the base model and the models with n_neighbors = 7 and n_neighbors = 9. Therefore, using n_neighbors = 3 may result in improved predictions for the condition of the patients based on the given test results.
knn = KNeighborsClassifier(n_neighbors=5, weights='distance') # default value of weights is 'uniform'
knn.fit(X_train, Y_train)
y_pred = knn.predict(X_test)
precision_7 = precision_score(Y_test, y_pred, average='macro')
recall_7 = recall_score(Y_test, y_pred, average='macro')
f1_7 = f1_score(Y_test, y_pred, average='macro')
accuracy_7 = accuracy_score(Y_test, y_pred)
# Print the performance metrics
print("Performance Metrics: ")
print("Precision:", precision_7)
print("Recall:", recall_7)
print("F1-Score:", f1_7)
print("Accuracy:", accuracy_7)
print("Score:", knn.score(X_test, Y_test))
print()
# Changes in performance metrics with respect to base model
precision_change = (precision_7 - precision) * 100
recall_change = (recall_7 - recall) * 100
f1_change = (f1_7 - f1) * 100
accuracy_change = (accuracy_7 - accuracy) * 100
# Print the changes in performance
print("Change in Performance metrics:")
print(f"Precision: +{precision_change:.2f}%","Improvement")
print(f"Recall: +{recall_change:.2f}%","Improvement")
print(f"F1-Score: +{f1_change:.2f}%","Improvement")
print(f"Accuracy: +{accuracy_change:.2f}%","Improvement \n")
Performance Metrics: Precision: 0.8597402597402596 Recall: 0.6568627450980392 F1-Score: 0.6385286888719383 Accuracy: 0.8275862068965517 Score: 0.8275862068965517 Change in Performance metrics: Precision: +1.45% Improvement Recall: +2.08% Improvement F1-Score: +1.59% Improvement Accuracy: +1.72% Improvement
We can observe that overall accuracy score and othe performance metrics has increased compared to base model which means it is more favourable to use 'distance' value to built best model.
warnings.filterwarnings("ignore")
# Creating the KNN classifier object
knn = KNeighborsClassifier()
# Defining the parameter grid for grid search
param_grid = {
'n_neighbors': list(range(1, 20)), # Values of K from 1 to 20
'weights': ['uniform', 'distance', 'custom_weights'], # Different weight options
'algorithm': ['auto', 'ball_tree', 'kd_tree', 'brute'], # Different algorithm options
'leaf_size': list(range(10,60,10)), # Different leaf sizes
'p': [1, 2, 3], # Different power parameters for Minkowski metric
'metric': ['euclidean', 'manhattan', 'Chebyshev'] # Different distance metrics
}
# Perform grid search with cross-validation
grid_search = GridSearchCV(estimator=knn, param_grid=param_grid, cv=10)
grid_search.fit(X_train, Y_train)
# Print the best parameter combination
print("Best Parameters: \n", grid_search.best_params_, "\n")
# Evaluate the model with best parameters on the test data
best_model = grid_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(Y_test, y_pred)
print("Accuracy:", accuracy)
Best Parameters:
{'algorithm': 'auto', 'leaf_size': 10, 'metric': 'euclidean', 'n_neighbors': 7, 'p': 1, 'weights': 'uniform'}
Accuracy: 0.7758620689655172
From the given output, we can draw the following important insights:
Best Parameter Combination: The output displays the best parameter combination selected by the grid search algorithm. These parameters are determined based on the performance of the KNN classifier using cross-validation. In this case, the best parameter combination consists of an 'algorithm' of 'auto', a 'leaf_size' of 10, 'metric' as 'euclidean', 'n_neighbors' as 7, 'p' as 1, and 'weights' as 'uniform'.Accuracy: The accuracy score on the test data is also displayed. In this case, the accuracy is 0.7758, which indicates that the KNN classifier with the best parameter combination achieved an accuracy of approximately 77.58% on the test data.# Create the KNN classifier object
knn = KNeighborsClassifier()
# Perform k-fold cross-validation
k = 7 # Number of folds
scores = cross_val_score(knn, X, y, cv=k, scoring='accuracy')
# Print the scores for each fold
print("Cross-Validation Scores:", scores)
# Calculate and print the average score
average_score = scores.mean()
print("Average Score:", average_score)
Cross-Validation Scores: [0.7804878 0.87804878 0.7804878 0.82926829 0.90243902 0.80487805 0.925 ] Average Score: 0.8429442508710802
From the given output, we can draw the following meaningful insights:
Cross-Validation Scores: The output displays the accuracy scores obtained for each fold during k-fold cross-validation. The accuracy scores for the individual folds are: [0.7804878 0.87804878 0.7804878 0.82926829 0.90243902 0.80487805 0.925]. These scores indicate the performance of the KNN classifier on different subsets of the data when trained and evaluated using cross-validation.Average Score: The average score is calculated by taking the mean of the accuracy scores obtained from cross-validation. In this case, the average score is approximately 0.8429. This value represents an overall estimate of the KNN classifier's performance on the dataset using k-fold cross-validation.# Define the parameter distribution for random search
param_dist = {
'n_neighbors': randint(1, 20), # Randomly sample values between 1 and 20
'weights': ['uniform', 'distance'],
'p': [1, 2],
'metric': ['euclidean', 'manhattan','chebyshev']
}
knn = KNeighborsClassifier()
# Perform random search with cross-validation
random_search = RandomizedSearchCV(estimator=knn, param_distributions=param_dist, cv=7, n_iter=10, random_state=10)
random_search.fit(X_train, Y_train)
print("Best Parameters:", random_search.best_params_)
# Evaluate the model with best parameters on the test data
best_model = random_search.best_estimator_
y_pred = best_model.predict(X_test)
accuracy = accuracy_score(Y_test, y_pred)
print("Accuracy:", accuracy)
Best Parameters: {'metric': 'chebyshev', 'n_neighbors': 13, 'p': 2, 'weights': 'distance'}
Accuracy: 0.7931034482758621
From the given output, we can draw the following important observations:
Random Search: The code performs a random search for hyperparameter tuning using RandomizedSearchCV. Random search randomly samples parameter values from the defined parameter distribution param_dist to search for the best combination of hyperparameters. It allows for a more efficient exploration of the hyperparameter space compared to an exhaustive grid search.Best Parameter Combination: The output displays the best parameter combination selected by the random search algorithm. The best parameters are: 'metric': 'chebyshev', 'n_neighbors': 13, 'p': 2, and 'weights': 'distance'. These parameters are determined based on the performance of the KNN classifier using cross-validation and random sampling of parameter values.Accuracy: The accuracy score on the test data is also displayed. In this case, the accuracy is 0.7931, indicating that the KNN classifier with the best parameter combination achieved an accuracy of approximately 79.31% on the test data.Let's plot and observe line-chart for the different values of scores againts different combinations of parameters. This chart will give us the insight about to the best or optimal values of K and other parameters to built best model.
weights = ['uniform', 'distance']
plt.figure(figsize=(10, 6))
for i, weight in enumerate(weights):
kscores = []
for k in range(1, 30):
knn = KNeighborsClassifier(n_neighbors=k, weights=weight, metric='euclidean')
knn.fit(X_train, Y_train)
kscores.append(knn.score(X_test, Y_test))
# Define line style for each weight
line_style = 'solid' if i == 1 else 'dashed'
sns.lineplot(x=range(1, 30), y=kscores, marker='o', markers=True, label=f'Weights: {weight}',
linestyle=line_style)
# Find the peak accuracy value and its corresponding k value
peak_index = np.argmax(kscores)
peak_k = peak_index + 1
peak_accuracy = kscores[peak_index]
# Label the peak point
plt.text(peak_k, peak_accuracy, f'{peak_accuracy:.4f}', ha='center', va='bottom')
plt.xlabel('n_neighbors')
plt.ylabel('Accuracy')
plt.title('Accuracy vs. n_neighbors for different Weights when metric = "euclidean"')
plt.yticks(np.linspace(0.70, 0.90, 12))
plt.xticks(range(1, 30))
plt.legend()
plt.show()
weights = ['uniform', 'distance']
plt.figure(figsize=(10, 6))
for i, weight in enumerate(weights):
kscores = []
for k in range(1, 30):
knn = KNeighborsClassifier(n_neighbors=k, weights=weight, metric='chebyshev')
knn.fit(X_train, Y_train)
kscores.append(knn.score(X_test, Y_test))
# Define line style for each weight
line_style = 'solid' if i == 1 else 'dashed'
sns.lineplot(x=range(1, 30), y=kscores, marker='o', markers=True, label=f'Weights: {weight}',
linestyle=line_style)
# Find the peak accuracy value and its corresponding k value
peak_index = np.argmax(kscores)
peak_k = peak_index + 1
peak_accuracy = kscores[peak_index]
# Label the peak point
plt.text(peak_k, peak_accuracy, f'{peak_accuracy:.4f}', ha='center', va='bottom')
plt.xlabel('n_neighbors')
plt.ylabel('Accuracy')
plt.title('When metric = "chebyshev"')
plt.yticks(np.linspace(0.70, 0.90, 12))
plt.xticks(range(1, 30))
plt.legend()
plt.show()
weights = ['uniform', 'distance']
plt.figure(figsize=(10, 6))
for i, weight in enumerate(weights):
kscores = []
for k in range(1, 30):
knn = KNeighborsClassifier(n_neighbors=k, weights=weight, metric='manhattan')
knn.fit(X_train, Y_train)
kscores.append(knn.score(X_test, Y_test))
# Define line style for each weight
line_style = 'solid' if i == 1 else 'dashed'
sns.lineplot(x=range(1, 30), y=kscores, marker='o', markers=True, label=f'Weights: {weight}',
linestyle=line_style)
# Find the peak accuracy value and its corresponding k value
peak_index = np.argmax(kscores)
peak_k = peak_index + 1
peak_accuracy = kscores[peak_index]
# Label the peak point
plt.text(peak_k, peak_accuracy, f'{peak_accuracy:.4f}', ha='center', va='bottom')
plt.xlabel('n_neighbors')
plt.ylabel('Accuracy')
plt.title('When metric = "manhattan"')
plt.yticks(np.linspace(0.70, 0.90, 12))
plt.xticks(range(1, 30))
plt.legend()
plt.show()
Observation:-
Euclidean and Chebyshev distance metric equally contributes the most for the improvement in our model performance as compared to 'Manhattan' distance metric.**
DOMAIN: Banking, Marketing
CONTEXT: A bank X is on a massive digital transformation for all its departments. Bank has a growing customer base where majority of them are
liability customers (depositors) vs borrowers (asset customers). The bank is interested in expanding the borrowers base rapidly to bring in more business via loan interests. A campaign that the bank ran in last quarter showed an average single digit conversion rate. Digital transformation being the core strength of the business strategy, marketing department wants to devise effective campaigns with better target marketing to increase the conversion ratio to double digit with same budget as per last campaign.
DATA DICTIONARY:
PROJECT OBJECTIVE: Build a Machine Learning model to perform focused marketing by predicting the potential customers who will convert using the historical dataset.
d1 = pd.read_csv('Data1.csv')
d2 = pd.read_csv('Data2.csv')
print("Shape of Data1: ",d1.shape)
print("Shape of data2: ",d2.shape)
Shape of Data1: (5000, 8) Shape of data2: (5000, 7)
print("Column names and thier respective DataTypes of Data1 are:\n",d1.dtypes)
print()
print("Column names and thier respective DataTypes of Data2 are:\n",d2.dtypes)
print()
Column names and thier respective DataTypes of Data1 are: ID int64 Age int64 CustomerSince int64 HighestSpend int64 ZipCode int64 HiddenScore int64 MonthlyAverageSpend float64 Level int64 dtype: object Column names and thier respective DataTypes of Data2 are: ID int64 Mortgage int64 Security int64 FixedDepositAccount int64 InternetBanking int64 CreditCard int64 LoanOnCard float64 dtype: object
df_new = pd.merge(d1, d2, on='ID')
df_new.head()
| ID | Age | CustomerSince | HighestSpend | ZipCode | HiddenScore | MonthlyAverageSpend | Level | Mortgage | Security | FixedDepositAccount | InternetBanking | CreditCard | LoanOnCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| 0 | 1 | 25 | 1 | 49 | 91107 | 4 | 1.6 | 1 | 0 | 1 | 0 | 0 | 0 | NaN |
| 1 | 2 | 45 | 19 | 34 | 90089 | 3 | 1.5 | 1 | 0 | 1 | 0 | 0 | 0 | NaN |
| 2 | 3 | 39 | 15 | 11 | 94720 | 1 | 1.0 | 1 | 0 | 0 | 0 | 0 | 0 | NaN |
| 3 | 4 | 35 | 9 | 100 | 94112 | 1 | 2.7 | 2 | 0 | 0 | 0 | 0 | 0 | NaN |
| 4 | 5 | 35 | 8 | 45 | 91330 | 4 | 1.0 | 2 | 0 | 0 | 0 | 0 | 1 | NaN |
df_new.shape
(5000, 14)
print("Column names and their respective DataTypes of Data1 are:\n", d1.dtypes)
print()
print("Column names and their respective DataTypes of Data2 are:\n", d2.dtypes)
print()
Bothe the data frames have been merged propely.
df_new.dtypes
ID int64 Age int64 CustomerSince int64 HighestSpend int64 ZipCode int64 HiddenScore int64 MonthlyAverageSpend float64 Level int64 Mortgage int64 Security int64 FixedDepositAccount int64 InternetBanking int64 CreditCard int64 LoanOnCard float64 dtype: object
We can see that the features ‘CreditCard’, ‘InternetBanking’, ‘FixedDepositAccount’, ‘Security’, ‘Level’, ‘HiddenScore’ are of datatype int64. So let's convert the datatype to 'Object'.
Reason behind performing this operation: - Values in these features are binary i.e. 1/0. But DataType is ‘int’/’float’ which is not expected.
'ZipCode' should also be changed to object datatype as it's catagorical variable.
columns_convert = ['ZipCode','CreditCard', 'InternetBanking', 'FixedDepositAccount', 'Security', 'Level', 'HiddenScore']
df_new[columns_convert] = df_new[columns_convert].astype('object')
print(df_new.dtypes)
ID int64 Age int64 CustomerSince int64 HighestSpend int64 ZipCode object HiddenScore object MonthlyAverageSpend float64 Level object Mortgage int64 Security object FixedDepositAccount object InternetBanking object CreditCard object LoanOnCard float64 dtype: object
Therefore we can notice that datatype of the mentioned featues have changed to 'object'.
sns.countplot(data=df_new, x='LoanOnCard')
plt.xlabel('Loan on card')
plt.ylabel('Count')
plt.title('Distribution of LoanOnCard');
Based on the countplot of the target variable 'LoanOnCard', it appears that the majority of customers in the dataset have a value of 0.0 for 'LoanOnCard' (indicating no loan on credit card), with a count of more than 4000. On the other hand, the count of customers with a value of 1.0 for 'LoanOnCard' (indicating a loan on credit card) is below 1000.
This suggests that there is an imbalance in the distribution of the target variable classes, with a significantly larger number of customers having no loan on their credit card compared to those who do have a loan.
Let's check for the missing value in the dataset and if present then check for the percentage of if feature wise.
before that first let's check for duplicate values.
df_new.duplicated().sum()
0
Therefore data is free of any duplicity.
# Check for missing values
pd.DataFrame({'Count':df_new.isnull().sum()[df_new.isnull().sum()>0],'Percentage':(df_new.isnull().sum()[df_new.isnull().sum()>0]/df_new.shape[0])*100})
| Count | Percentage | |
|---|---|---|
| LoanOnCard | 20 | 0.4 |
We can notice that only target column has 20 missing values or 0.4% of datapoints missing.
Since there is an imbalance in the dataset, with a significantly large number of 0's compared to 1's in the target variable, and the target column has very few missing values (0.4% of datapoints missing in this case), it would be best to remove the rows with missing values rather than imputing them. Here's why:
Impact on class distribution: Imputing the missing values in the target variable can potentially introduce bias and distort the class distribution, especially if the missing values are imputed with the majority class (0's) since they are more prevalent. Removing the rows with missing values ensures that the original class distribution is preserved.Preserving data integrity: Imputing missing values in the target variable is often a challenging task, especially when dealing with imbalanced datasets. Since the target variable is crucial for model training and evaluation, imputing missing values might introduce uncertainty and compromise the integrity of the data.Sample representativeness: By removing the rows with missing values, you retain a representative sample that accurately reflects the available data. This can help ensure that your model is trained on reliable and unbiased information.df_new = df_new.dropna(subset=['LoanOnCard'])
df_new.shape
(4980, 14)
We can notice that all the rows having missing target values have been dropped and dataset is now ready for the split.
We know that features - ‘CreditCard’, ‘InternetBanking’, ‘FixedDepositAccount’, ‘Security’, ‘Level’, ‘HiddenScore’ are of datatype object. So let's check the unique values in each of these features.
columns = ['CreditCard','InternetBanking', 'FixedDepositAccount', 'Security', 'Level', 'HiddenScore']
for i in columns:
print('Unique values for ',i,':',df_new[i].unique()) # Print unique values in the column
print(df_new[i].value_counts()) # Count number of rows with each unique value of feature
print('\n')
Unique values for CreditCard : [0 1] CreditCard 0 3514 1 1466 Name: count, dtype: int64 Unique values for InternetBanking : [0 1] InternetBanking 1 2974 0 2006 Name: count, dtype: int64 Unique values for FixedDepositAccount : [0 1] FixedDepositAccount 0 4678 1 302 Name: count, dtype: int64 Unique values for Security : [0 1] Security 0 4460 1 520 Name: count, dtype: int64 Unique values for Level : [3 2 1] Level 1 2089 3 1496 2 1395 Name: count, dtype: int64 Unique values for HiddenScore : [1 4 3 2] HiddenScore 1 1466 2 1293 4 1215 3 1006 Name: count, dtype: int64
Hence no unexpected values in any of these Catagorical Variables is observed.
Before splitting the data let's look at the correlation between the features in the dataset and also take a look at some visualizations like Heatmap and pairplot. This will give us some meaningful insights about the features and will also help us decide which features to take into the consideration for good model building.
# Correlation between the columns
df_new.corr()
| ID | Age | CustomerSince | HighestSpend | ZipCode | HiddenScore | MonthlyAverageSpend | Level | Mortgage | Security | FixedDepositAccount | InternetBanking | CreditCard | LoanOnCard | |
|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
| ID | 1.000000 | -0.010682 | -0.010366 | -0.020739 | 0.013543 | -0.015721 | -0.026419 | 0.021763 | -0.015546 | -0.017160 | -0.008690 | -0.003940 | 0.015741 | -0.027188 |
| Age | -0.010682 | 1.000000 | 0.994208 | -0.054951 | -0.029939 | -0.045289 | -0.051896 | 0.042750 | -0.013272 | 0.000323 | 0.007744 | 0.011227 | 0.007344 | -0.008147 |
| CustomerSince | -0.010366 | 0.994208 | 1.000000 | -0.046092 | -0.029415 | -0.051456 | -0.049918 | 0.014545 | -0.011380 | -0.000469 | 0.010085 | 0.011355 | 0.008779 | -0.007801 |
| HighestSpend | -0.020739 | -0.054951 | -0.046092 | 1.000000 | -0.016447 | -0.158357 | 0.646109 | -0.188909 | 0.207236 | -0.002284 | 0.169535 | 0.014202 | -0.002780 | 0.502626 |
| ZipCode | 0.013543 | -0.029939 | -0.029415 | -0.016447 | 1.000000 | 0.012250 | -0.004339 | -0.017274 | 0.007214 | 0.006312 | 0.020007 | 0.017186 | 0.007302 | 0.000120 |
| HiddenScore | -0.015721 | -0.045289 | -0.051456 | -0.158357 | 0.012250 | 1.000000 | -0.109180 | 0.065762 | -0.021396 | 0.019061 | 0.014327 | 0.010900 | 0.010784 | 0.061761 |
| MonthlyAverageSpend | -0.026419 | -0.051896 | -0.049918 | 0.646109 | -0.004339 | -0.109180 | 1.000000 | -0.137020 | 0.110275 | 0.015105 | 0.136410 | -0.003475 | -0.006577 | 0.366912 |
| Level | 0.021763 | 0.042750 | 0.014545 | -0.188909 | -0.017274 | 0.065762 | -0.137020 | 1.000000 | -0.032863 | -0.009443 | 0.013982 | -0.014556 | -0.011766 | 0.137010 |
| Mortgage | -0.015546 | -0.013272 | -0.011380 | 0.207236 | 0.007214 | -0.021396 | 0.110275 | -0.032863 | 1.000000 | -0.005002 | 0.089167 | -0.007044 | -0.007600 | 0.141947 |
| Security | -0.017160 | 0.000323 | -0.000469 | -0.002284 | 0.006312 | 0.019061 | 0.015105 | -0.009443 | -0.005002 | 1.000000 | 0.317673 | 0.014007 | -0.014518 | 0.021982 |
| FixedDepositAccount | -0.008690 | 0.007744 | 0.010085 | 0.169535 | 0.020007 | 0.014327 | 0.136410 | 0.013982 | 0.089167 | 0.317673 | 1.000000 | 0.176082 | 0.278924 | 0.316131 |
| InternetBanking | -0.003940 | 0.011227 | 0.011355 | 0.014202 | 0.017186 | 0.010900 | -0.003475 | -0.014556 | -0.007044 | 0.014007 | 0.176082 | 1.000000 | 0.004960 | 0.006034 |
| CreditCard | 0.015741 | 0.007344 | 0.008779 | -0.002780 | 0.007302 | 0.010784 | -0.006577 | -0.011766 | -0.007600 | -0.014518 | 0.278924 | 0.004960 | 1.000000 | 0.002536 |
| LoanOnCard | -0.027188 | -0.008147 | -0.007801 | 0.502626 | 0.000120 | 0.061761 | 0.366912 | 0.137010 | 0.141947 | 0.021982 | 0.316131 | 0.006034 | 0.002536 | 1.000000 |
Let's now visualize the correlation with the help of Heatmap.
plt.figure(figsize=(10,10))
sns.heatmap(df_new.corr(), vmin=-1, vmax=1, annot=True, fmt='.3f');
Let's look at the pairplot for the features.
sns.pairplot(df_new, hue='LoanOnCard', palette="Set1")
plt.show()
From the above plots, we can draw the following observations:
Age and CustomerSince have a high positive correlation of approximately 0.99. This indicates that these two variables are highly correlated and including both of them in the model might introduce multicollinearity. Therefore one of the features needs to be dropped from the dataset for model building.MonthlyAverageSpend and HighestSpend have a moderate positive correlation of approximately 0.65. This suggests that customers who have a higher monthly average spend also tend to have a higher highest spend in one transaction.LoanOnCard, the target variable, has a moderate positive correlation with HighestSpend (approximately 0.50). This indicates that customers with higher highest spend are more likely to have a loan on their credit card.ZipCode, HiddenScore, Level, Mortgage, Security, FixedDepositAccount, InternetBanking, and CreditCard have weak correlations (absolute values less than 0.20) with LoanOnCard. These variables may have less influence on the target variable.Overall the features are weakly correlated with each other indicating their independent nature in the dataset.
Before splitting the data first let's drop the features - 'ID', 'CustomerSince' and 'ZipCode' from the original dataset and store the modified dataset in new variable.
The 'ID' column typically represents a unique identifier for each record in the dataset, which does not provide any meaningful information for model prediction. Since 'CustomerSince' and 'Age' are highly correlated therefore dropping one would be necessary so we drop 'CustomerSince.
df_mod = df_new.copy()
df_mod.drop(['ID', 'ZipCode', 'CustomerSince'], axis=1, inplace=True)
df_mod.columns
Index(['Age', 'HighestSpend', 'HiddenScore', 'MonthlyAverageSpend', 'Level',
'Mortgage', 'Security', 'FixedDepositAccount', 'InternetBanking',
'CreditCard', 'LoanOnCard'],
dtype='object')
We can see that the features have been dropped and the dataset is now ready for split.
# Splitting data into x and y
x = df_mod.drop(labels= "LoanOnCard" , axis = 1)
y = df_mod["LoanOnCard"]
x.columns
Index(['Age', 'HighestSpend', 'HiddenScore', 'MonthlyAverageSpend', 'Level',
'Mortgage', 'Security', 'FixedDepositAccount', 'InternetBanking',
'CreditCard'],
dtype='object')
We can notice that data has been split into X and Y.
# Split X and y into training and test set in 75:25 ratio
train_x, test_x, train_y, test_y = train_test_split(x, y, test_size=0.25, random_state=42)
We may even use stratify here.
For Eg-
train_x, test_x, train_y, test_y = train_test_split(x, y, test_size=0.25, random_state=42, stratify=y)
print("Shape of Train data is: ",train_x.shape)
print("Shape of Test data is: ", test_x.shape)
Shape of Train data is: (3735, 10) Shape of Test data is: (1245, 10)
Hence data is split into train and test set.
warnings.filterwarnings("ignore")
model = LogisticRegression()
# Train the model on the training data
model.fit(train_x, train_y)
# Make predictions on the test data
pred_y = model.predict(test_x)
y_pred_proba = model.predict_proba(test_x)[:, 1]
fpr, tpr, thresholds = roc_curve(test_y, y_pred_proba)
auc_score = auc(fpr, tpr)
print("Performance Metrics: ")
prec = precision_score(test_y, pred_y)
print("Precision:", prec)
rec = recall_score(test_y, pred_y)
print("Recall:", rec)
f1_log = f1_score(test_y, pred_y)
print("F1-Score:", f1_log)
roc_auc = roc_auc_score(test_y, pred_y)
print("ROC AUC Score:", roc_auc)
print("Area under curve: ", auc_score)
acc = accuracy_score(test_y, pred_y)
print("Accuracy:", acc, '\n')
print("Confusion Matrix:")
cm=metrics.confusion_matrix(test_y, pred_y, labels=[1, 0])
df_cm = pd.DataFrame(cm, index = [i for i in ["1","0"]],
columns = [i for i in ["Predicted 1","Predicted 0"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True, fmt='g');
print()
Performance Metrics: Precision: 0.7204301075268817 Recall: 0.5726495726495726 F1-Score: 0.638095238095238 ROC AUC Score: 0.7747999636297508 Area under curve: 0.939390495241559 Accuracy: 0.9389558232931727 Confusion Matrix:
Based on the performance metrics and the confusion matrix, here are some observations and insights:
Precision: The precision score of 0.7204 indicates that out of all the instances predicted as positive, approximately 72.04% were actually positive. It reflects the model's ability to avoid false positive predictions.Recall: The recall score of 0.5726 suggests that the model identified approximately 57.26% of the actual positive instances correctly. It represents the model's ability to find all the positive instances and avoid false negatives.F1-Score: The F1-score of 0.6381 is the harmonic mean of precision and recall. It provides a balanced measure of the model's overall performance, considering both precision and recall.ROC AUC Score: The ROC AUC score of 0.7748 indicates the model's overall ability to discriminate between positive and negative samples. A higher score suggests better discrimination performance.Accuracy: The accuracy score of 0.9389 indicates that the model correctly predicted approximately 93.89% of the instances overall, regardless of the class.From the confusion matrix:
* The model correctly identified 67 instances as positive (True Positives).
* The model correctly identified 1102 instances as negative (True Negatives).
* The model missed 50 positive instances, incorrectly predicting them as negative (False Negatives).
* The model incorrectly predicted 26 instances as positive when they were actually negative (False Positives).
Based on these observations, it appears that the model has a relatively high accuracy and can effectively identify negative instances. However, there is room for improvement in terms of correctly identifying positive instances (higher recall) and reducing false positive predictions (higher precision).
Let's balance the data using SMOTE (Synthetic Minority Over-sampling Technique)
SMOTE is a popular technique for oversampling the minority class by creating synthetic samples rather than duplicating existing instances. It helps to balance the data and introduce diversity in the minority class.
First Let's copy the original dataframe into a new variable.
d_new = df_new.copy()
# Split the dataset into features (X_new) and target variable (y_new)
x_new = d_new.drop('LoanOnCard', axis=1)
y_new = d_new['LoanOnCard']
# Identify the minority and majority class in the target variable
minority_class = d_new['LoanOnCard'].value_counts().idxmin()
majority_class = d_new['LoanOnCard'].value_counts().idxmax()
# Apply SMOTE to balance the dataset
smote = SMOTE(sampling_strategy='auto', random_state=42)
x_balanced, y_balanced = smote.fit_resample(x_new, y_new)
# Verify the balanced class distribution
balanced_counts = y_balanced.value_counts()
print("Balanced Class Distribution:")
print(balanced_counts)
Balanced Class Distribution: LoanOnCard 1.0 4500 0.0 4500 Name: count, dtype: int64
Therefore we can observe that target variable is now balanced as 50;50 or in ratio 1:1.
x_train_bal, x_test_bal, y_train_bal, y_test_bal = train_test_split(x_balanced, y_balanced, test_size=0.25, random_state=42)
# Train the model on the balanced training data
model_bal = LogisticRegression()
model_bal.fit(x_train_bal, y_train_bal)
LogisticRegression()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
LogisticRegression()
# Make predictions on the balanced test data
pred_y_bal = model_bal.predict(x_test_bal)
We have trained the model on balanced data.
y_pred_proba_bal = model_bal.predict_proba(x_test_bal)[:, 1]
fpr, tpr, thresholds = roc_curve(y_test_bal, y_pred_proba_bal)
auc_score_bal = auc(fpr, tpr)
print("Performance metrics after data balance:")
precision_bal = precision_score(y_test_bal, pred_y_bal)
print("Precision:", precision_bal)
recall_bal = recall_score(y_test_bal, pred_y_bal)
print("Recall:", recall_bal)
f1_bal = f1_score(y_test_bal, pred_y_bal)
print("F1-Score:", f1_bal)
roc_auc_bal = roc_auc_score(y_test_bal, pred_y_bal)
print("ROC AUC Score:", roc_auc_bal)
print("Area under the curve: ", auc_score_bal)
accuracy_bal = accuracy_score(y_test_bal, pred_y_bal)
print("Accuracy:", accuracy_bal, '\n')
# Changes in performance metrics with respect to base model
precision_change_bal = (precision_bal - prec) * 100
recall_change_bal = (recall_bal - rec) * 100
f1_change_bal = (f1_bal - f1_log) * 100
roc_auc_change_bal = (roc_auc_bal - roc_auc) * 100
auc_change = (auc_score_bal - auc_score) * 100
accuracy_change_bal = (accuracy_bal - acc) * 100
# Print the changes in performance
print("Change in Performance metrics:")
print(f"Precision: +{precision_change_bal:.1f}%","Improvement")
print(f"Recall: +{recall_change_bal:.1f}%","Improvement")
print(f"F1-Score: +{f1_change_bal:.1f}%","Improvement")
print(f"ROC AUC Score: +{roc_auc_change_bal:.1f}%","Improvement")
print(f"Area under curve: {auc_change:.1f}%")
print(f"Accuracy: {accuracy_change_bal:.1f}%","\n")
print("Confusion Matrix:")
cm_bal=metrics.confusion_matrix(y_test_bal, pred_y_bal, labels=[1, 0])
df_cm = pd.DataFrame(cm_bal, index = [i for i in ["1","0"]],
columns = [i for i in ["Predicted 1","Predicted 0"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True, fmt='g', cmap='magma');
print()
Performance metrics after data balance: Precision: 0.8425849212924607 Recall: 0.8944591029023746 F1-Score: 0.8677474402730375 ROC AUC Score: 0.8618746547755358 Area under the curve: 0.9157498216093327 Accuracy: 0.8622222222222222 Change in Performance metrics: Precision: +12.2% Improvement Recall: +32.2% Improvement F1-Score: +23.0% Improvement ROC AUC Score: +8.7% Improvement Area under curve: -2.4% Accuracy: -7.7% Confusion Matrix:
Based on the observations from comparing the performance metrics of the base model and the model trained on balanced data are as follows:
Precision: The precision of the model significantly improved after balancing the data. This indicates that the model trained on balanced data has a higher ability to correctly identify the positive class (LoanOnCard = 1) compared to the base model.Recall: The recall of the model also improved after balancing the data. This suggests that the model trained on balanced data has a higher ability to correctly capture the positive class instances, reducing the number of false negatives compared to the base model.F1-Score: The F1-score, which considers both precision and recall, also shows improvement after balancing the data. This indicates that the overall performance of the model in terms of both precision and recall is better when trained on balanced data.ROC AUC Score: The ROC AUC score provides an overall measure of the model's ability to discriminate between the positive and negative classes. The improved ROC AUC score after balancing the data suggests that the model trained on balanced data has a better ability to distinguish between the two classes compared to the base model.Area under the curve (AUC): The AUC is a measure of the overall performance of the model in distinguishing between the positive and negative classes. The AUC value decreased after balancing the data, indicating a slight decrease in the model's discriminatory power. However, it is still relatively high in both cases, indicating good performance.Accuracy: The accuracy of the model trained on balanced data is lower compared to the base model. This is expected because the base model was trained on imbalanced data where the majority class (LoanOnCard = 0) was dominant. However, accuracy alone may not be the best metric to evaluate the model's performance when dealing with imbalanced data.Overall, balancing the data using SMOTE has led to improved performance metrics for the logistic regression model. It has effectively addressed the class imbalance issue and resulted in a model that performs better in terms of precision, recall, F1-score, and ROC AUC score.
Let's first train base model using SVM.
# Create an instance of the SVM classifier
svm_model = SVC(probability=True)
# Train the model on the training data
svm_model.fit(x_train_bal, y_train_bal)
SVC(probability=True)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
SVC(probability=True)
# Make predictions on the test data
pred_y = svm_model.predict(x_test_bal)
We have trained the model using SVM.
Now let's look at the performance metrics.
y_pred_proba = svm_model.predict_proba(x_test_bal)[:, 1]
fpr2, tpr2, thresholds2 = roc_curve(y_test_bal, y_pred_proba)
auc_score2 = auc(fpr2, tpr2)
print("Performance Metrics: ")
prec_svm = precision_score(y_test_bal, pred_y)
print("Precision:", prec_svm)
rec_svm = recall_score(y_test_bal, pred_y)
print("Recall:", rec_svm)
f1_svm = f1_score(y_test_bal, pred_y)
print("F1-Score:", f1_svm)
roc_auc_svm = roc_auc_score(y_test_bal, pred_y)
print("ROC AUC Score:", roc_auc_svm)
print("Area under curve: ", auc_score2)
acc_svm = accuracy_score(y_test_bal, pred_y)
print("Accuracy:", acc_svm, '\n')
print("Confusion Matrix:")
cm=metrics.confusion_matrix(y_test_bal, pred_y, labels=[1, 0])
df_cm = pd.DataFrame(cm, index = [i for i in ["1","0"]],
columns = [i for i in ["Predicted 1","Predicted 0"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True, fmt='g');
print()
Performance Metrics: Precision: 0.0 Recall: 0.0 F1-Score: 0.0 ROC AUC Score: 0.5 Area under curve: 0.5 Accuracy: 0.49466666666666664 Confusion Matrix:
Let's now build the base model using KNN.
Before building the model using KNN we have to scale the data to ensure the units are standardized.
z = x_balanced.astype(float) # Convert all columns to float type
x_scaled = z.apply(zscore) # convert all attributes to Z scale
x_scaled.describe().T
| count | mean | std | min | 25% | 50% | 75% | max | |
|---|---|---|---|---|---|---|---|---|
| ID | 9000.0 | -1.318452e-16 | 1.000056 | -1.734291 | -0.869276 | -0.019446 | 0.867817 | 1.790041 |
| Age | 9000.0 | -1.405296e-16 | 1.000056 | -2.065422 | -0.841291 | 0.006183 | 0.853658 | 2.077789 |
| CustomerSince | 9000.0 | -1.010550e-16 | 1.000056 | -2.135783 | -0.818702 | 0.027993 | 0.780611 | 2.191769 |
| HighestSpend | 9000.0 | -1.010550e-16 | 1.000056 | -1.876027 | -0.890516 | 0.152967 | 0.829298 | 2.297903 |
| ZipCode | 9000.0 | -2.273737e-16 | 1.000056 | -42.547372 | -0.622876 | 0.152799 | 0.761077 | 1.771948 |
| HiddenScore | 9000.0 | -1.705303e-16 | 1.000056 | -1.421845 | -0.741419 | -0.027523 | 0.755988 | 1.433208 |
| MonthlyAverageSpend | 9000.0 | -1.263187e-16 | 1.000056 | -1.416977 | -0.812556 | -0.157767 | 0.676705 | 3.619864 |
| Level | 9000.0 | -2.589534e-16 | 1.000056 | -1.345008 | -1.345008 | -0.042957 | 1.226401 | 1.259095 |
| Mortgage | 9000.0 | -1.010550e-16 | 1.000056 | -0.604381 | -0.604381 | -0.604381 | 0.384151 | 4.760730 |
| Security | 9000.0 | -1.026340e-17 | 1.000056 | -0.383565 | -0.383565 | -0.383565 | -0.383565 | 3.090486 |
| FixedDepositAccount | 9000.0 | -6.315935e-17 | 1.000056 | -0.501844 | -0.501844 | -0.501844 | -0.409794 | 2.572748 |
| InternetBanking | 9000.0 | 5.230384e-17 | 1.000056 | -1.321698 | -1.321698 | 0.816669 | 0.886371 | 0.886371 |
| CreditCard | 9000.0 | 1.184238e-17 | 1.000056 | -0.707782 | -0.707782 | -0.707782 | 1.114255 | 1.654858 |
Therefore our data has been scaled and is ready for model building.
# Create an instance of the KNN classifier
knn_model = KNeighborsClassifier()
# Train the model on the training data
knn_model.fit(x_train_bal, y_train_bal)
KNeighborsClassifier()In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
KNeighborsClassifier()
# Make predictions on the test data
pred_y = knn_model.predict(x_test_bal)
We have trained the model using KNN.
Now let's look at the performance metrics.
print("Performance Metrics:")
prec_knn = precision_score(y_test_bal, pred_y)
print("Precision:", prec_knn)
rec_knn = recall_score(y_test_bal, pred_y)
print("Recall:", rec_knn)
f1_knn = f1_score(y_test_bal, pred_y)
print("F1-Score:", f1_knn)
roc_auc_knn = roc_auc_score(y_test_bal, pred_y)
print("ROC AUC Score:", roc_auc_knn)
acc_knn = accuracy_score(y_test_bal, pred_y)
print("Accuracy:", acc_knn, '\n')
print("Confusion Matrix:")
cm=metrics.confusion_matrix(y_test_bal, pred_y, labels=[1, 0])
df_cm = pd.DataFrame(cm, index = [i for i in ["1","0"]],
columns = [i for i in ["Predicted 1","Predicted 0"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True, fmt='g', cmap='Greens');
print()
Performance Metrics: Precision: 0.806282722513089 Recall: 0.9481090589270009 F1-Score: 0.8714632174616006 ROC AUC Score: 0.8577023282056389 Accuracy: 0.8586666666666667 Confusion Matrix:
-> You have to build a base model without tuning any parameters on the balanced data.
Let's first do hyperparameter tuning for SVM by implementing GridSearch Technique.
# Create an SVM classifier
clf = SVC()
# Define the parameter grid to search over
param_grid = {
'C': [0.1, 1],
'kernel': ['linear', 'rbf'],
'gamma': [0.1, 1]
}
# Perform grid search to find the best parameters
grid_search = GridSearchCV(clf, param_grid, cv=5, n_jobs=-1)
grid_search.fit(x_train_bal, y_train_bal)
# Get the best parameters and best score
best_params = grid_search.best_params_
best_score = grid_search.best_score_
print("Best Parameters:", best_params)
print("Best Score:", best_score)
Best Parameters: {'C': 1, 'gamma': 0.1, 'kernel': 'linear'}
Best Score: 0.8902222222222221
Therefore we have successfully performed hyperparameter tuning for SVM using grid search. The best parameters obtained from the grid search are {'C': 1, 'gamma': 0.1, 'kernel': 'linear'}, which suggests that the SVM model with a linear kernel and regularization parameter C=1 and gamma=0.1 achieved the highest performance. The best score obtained is 0.8902222222222221, which indicates the evaluation metric used during the grid search. From these results, we can infer that the chosen combination of hyperparameters led to the best-performing SVM model based on the selected evaluation metric.
Hyperparameter tuning for SVM by implementing RandomSearch Technique
# Create an SVM classifier
clf_r = SVC()
# Define the parameter distribution to sample from
param_dist = {
'C': reciprocal(0.1, 1, 10),
'kernel': ['linear', 'rbf'],
'gamma': uniform(0.1, 10)
}
# Perform random search to find the best parameters
random_search = RandomizedSearchCV(clf_r, param_distributions=param_dist, n_iter=10, cv=5)
random_search.fit(x_train_bal, y_train_bal)
# Get the best parameters and best score
best_params = random_search.best_params_
best_score = random_search.best_score_
print("Best Parameters:", best_params)
print("Best Score:", best_score)
We have successfully performed hyperparameter tuning for SVM using Randomized Search. The results of the search yielded the following best parameters: {'C': 10.190049956545678, 'gamma': 1.5213606054959794, 'kernel': 'linear'}. The corresponding best score achieved was 1.5213606054959794. From these results, we can infer that the SVM model with a linear kernel, C value of 10.190049956545678, and gamma value of 1.5213606054959794 achieved the highest performance according to the chosen evaluation metric.
NOTE: RandomSearch performed slightly better then the Grid-Search, therefore we will build best SVM model using the parameter values provided by RandomSearch.¶Let's now finalize and build SVM model using best parameters obtained and compare it with base model.
# Using best parameters
clf_best = SVC(C=10.190049956545678, gamma=1.5213606054959794, kernel='linear', probability=True)
clf_best.fit(x_train_bal, y_train_bal)
SVC(C=10.190049956545678, gamma=1.5213606054959794, kernel='linear',
probability=True)In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook. SVC(C=10.190049956545678, gamma=1.5213606054959794, kernel='linear',
probability=True)pred_y = clf_best.predict(x_test_bal)
y_pred_proba = clf_best.predict_proba(x_test_bal)[:, 1]
fpr3, tpr3, thresholds3 = roc_curve(y_test_bal, y_pred_proba)
auc_score3 = auc(fpr3, tpr3)
print("Performance metrics for best parameters: ")
prec_b = precision_score(y_test_bal, pred_y)
print("Precision:", prec_b)
rec_b = recall_score(y_test_bal, pred_y)
print("Recall:", rec_b)
f1_svm_b = f1_score(y_test_bal, pred_y)
print("F1-Score:", f1_svm_b)
roc_auc_b = roc_auc_score(y_test_bal, pred_y)
print("ROC AUC Score:", roc_auc_b)
print("Area under curve:", auc_score3)
acc_b = accuracy_score(y_test_bal, pred_y)
print("Accuracy:", acc_b, '\n')
precision_change = (prec_b - precision_bal) * 100
recall_change = (rec_b - recall_bal) * 100
f1_change = (f1_svm_b - f1_bal) * 100
roc_auc_change = (roc_auc_b - roc_auc_bal) * 100
auc_change = (auc_score3 - auc_score_bal) * 100
accuracy_change = (acc_b - accuracy_bal) * 100
# Print the changes in performance
print("Change in Performance metrics compared to base model after using best parameters: ")
print(f"Precision: +{precision_change:.2f}%","Improvement")
print(f"Recall: +{recall_change:.2f}%","Improvement")
print(f"F1-Score: +{f1_change:.2f}%","Improvement")
print(f"ROC AUC Score: +{roc_auc_change:.2f}%","Improvement")
print(f"Area under curve: +{auc_change:.2f}%","Improvement")
print(f"Accuracy: +{accuracy_change:.2f}%","Improvement","\n")
print("Confusion Matrix:")
cm=metrics.confusion_matrix(y_test_bal, pred_y, labels=[1, 0])
df_cm = pd.DataFrame(cm, index = [i for i in ["1","0"]],
columns = [i for i in ["Predicted 1","Predicted 0"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True, fmt='g');
print()
Performance metrics for best parameters: Precision: 0.8469135802469135 Recall: 0.9050131926121372 F1-Score: 0.875 ROC AUC Score: 0.8689486448235888 Area under curve: 0.922556719539843 Accuracy: 0.8693333333333333 Change in Performance metrics compared to base model after using best parameters: Precision: +0.43% Improvement Recall: +1.06% Improvement F1-Score: +0.73% Improvement ROC AUC Score: +0.71% Improvement Area under curve: +0.68% Improvement Accuracy: +0.71% Improvement Confusion Matrix:
The SVM model trained on the balanced data with the best combination of parameters outperforms the base model in terms of precision, recall, F1-score, ROC AUC score, and accuracy. This indicates that the SVM model with the optimized hyperparameters is more effective at predicting the target variable on the balanced data.
Let's now perform hyperparameter tuning for KNN by implementing GridSearch Technique.
warnings.filterwarnings("ignore")
# Creating the KNN classifier object
clf_knn = KNeighborsClassifier()
# Defining the parameter grid for grid search
param_grid = {
'n_neighbors': list(range(3, 20)), # Values of K from 3 to 19
'weights': ['uniform', 'distance'], # Different weight options
'metric': ['euclidean', 'manhattan', 'Chebyshev'] # Different distance metrics
}
# Perform grid search with cross-validation
grid_search = GridSearchCV(estimator=clf_knn, param_grid=param_grid, cv=10)
grid_search.fit(x_train_bal, y_train_bal)
# Print the best parameter combination
print("Best Parameters: \n", grid_search.best_params_, "\n")
# Evaluate the model with best parameters on the test data
best_model = grid_search.best_estimator_
y_pred = best_model.predict(x_test_bal)
accuracy = accuracy_score(y_test_bal, y_pred)
print("Accuracy:", accuracy)
Best Parameters:
{'metric': 'manhattan', 'n_neighbors': 4, 'weights': 'distance'}
Accuracy: 0.9124444444444444
We have successfully performed hyperparameter tuning for KNN using Grid-Search. The results of the search yielded the following best parameters: {'metric': 'manhattan', 'n_neighbors': 4, 'weights': 'distance'}. The corresponding best score achieved was 0.9124444444444444. From these results, we can infer that the KNN model with a 'manhattan' metric, value of K=4 and weights as 'distance' achieved the highest performance according to the chosen evaluation metric.
Hyperparameter tuning for KNN by implementing Randomized Search Cross-Validation Technique.
# Define the parameter distribution for random search
param_dist = {
'n_neighbors': randint(3, 20), # Randomly sample values between 3 and 20
'weights': ['uniform', 'distance'],
'metric': ['euclidean', 'manhattan','chebyshev']
}
clf_knn = KNeighborsClassifier()
# Perform random search with cross-validation
random_search = RandomizedSearchCV(estimator=clf_knn, param_distributions=param_dist, cv=10, n_iter=10, random_state=10)
random_search.fit(x_train_bal, y_train_bal)
print("Best Parameters:", random_search.best_params_)
# Evaluate the model with best parameters on the test data
best_model = random_search.best_estimator_
y_pred = best_model.predict(x_test_bal)
accuracy = accuracy_score(y_test_bal, y_pred)
print("Accuracy:", accuracy)
Best Parameters: {'metric': 'manhattan', 'n_neighbors': 7, 'weights': 'distance'}
Accuracy: 0.8951111111111111
We have successfully performed hyperparameter tuning for KNN using Randomized Search Cross-Validation. The results of the search yielded the following best parameters: {'metric': 'manhattan', 'n_neighbors': 7, 'weights': 'distance'}. The corresponding best score achieved was 0.8951111111111111. From these results, we can infer that the KNN model with a 'manhattan' metric, value of K=7 and weights as 'distance' achieved the highest performance according to the chosen evaluation metric.
NOTE: Grid-Search performed slightly better then the Randomized Search Cross-Validation, therefore we will build best SVM model using the parameter values provided by Grid-Search.¶Let's now finalize and build KNN model using best parameters obtained and compare it with base model.
# Using best parameters
clf_knn_b = KNeighborsClassifier(n_neighbors=4, metric='manhattan', weights='distance')
clf_knn_b.fit(x_train_bal, y_train_bal)
KNeighborsClassifier(metric='manhattan', n_neighbors=4, weights='distance')In a Jupyter environment, please rerun this cell to show the HTML representation or trust the notebook.
KNeighborsClassifier(metric='manhattan', n_neighbors=4, weights='distance')
pred_y = clf_knn_b.predict(x_test_bal)
print("Performance metrics for best parameters: ")
prec_knn_b = precision_score(y_test_bal, pred_y)
print("Precision:", prec_knn_b)
rec_knn_b = recall_score(y_test_bal, pred_y)
print("Recall:", rec_knn_b)
f1_knn_b = f1_score(y_test_bal, pred_y)
print("F1-Score:", f1_knn_b)
acc_knn_b = accuracy_score(y_test_bal, pred_y)
print("Accuracy:", acc_knn_b, '\n')
precision_change = (prec_knn_b - precision_bal) * 100
recall_change = (rec_knn_b - recall_bal) * 100
f1_change = (f1_knn_b - f1_bal) * 100
accuracy_change = (acc_knn_b - accuracy_bal) * 100
# Print the changes in performance
print("Change in Performance metrics compared to base model after using best parameters: ")
print(f"Precision: +{precision_change:.2f}%","Improvement")
print(f"Recall: +{recall_change:.2f}%","Improvement")
print(f"F1-Score: +{f1_change:.2f}%","Improvement")
print(f"Accuracy: +{accuracy_change:.2f}%","Improvement","\n")
print("Confusion Matrix:")
cm=metrics.confusion_matrix(y_test_bal, pred_y, labels=[1, 0])
df_cm = pd.DataFrame(cm, index = [i for i in ["1","0"]],
columns = [i for i in ["Predicted 1","Predicted 0"]])
plt.figure(figsize = (7,5))
sns.heatmap(df_cm, annot=True, fmt='g', cmap='inferno');
print()
Performance metrics for best parameters: Precision: 0.8632148377125193 Recall: 0.9824098504837291 F1-Score: 0.9189633895516248 Accuracy: 0.9124444444444444 Change in Performance metrics compared to base model after using best parameters: Precision: +2.06% Improvement Recall: +8.80% Improvement F1-Score: +5.12% Improvement Accuracy: +5.02% Improvement Confusion Matrix:
Precision: The KNN model has slightly higher precision compared to the base model. This means that the KNN model is better at correctly identifying true positives (correctly predicting positive samples) while minimizing false positives.Recall: The recall of the KNN model is significantly higher than the base model. This indicates that the KNN model can identify a higher proportion of true positives in the dataset, meaning it has a lower false negative rate.F1-Score: The F1-Score of the KNN model is higher than the base model. The F1-Score considers both precision and recall and provides a balanced measure of the model's performance. The higher F1-Score for the KNN model indicates that it has a better overall balance between precision and recall.Accuracy: The accuracy of the KNN model is also higher than the base model. Accuracy measures the overall correctness of the model's predictions, and the higher accuracy of the KNN model suggests that it is making more correct predictions on the balanced dataset.Overall, the KNN model outperforms the base model (Logistic Regression) in all performance metrics, indicating that it is a better model for the balanced dataset. The KNN model shows improved precision, recall, F1-Score, and accuracy, making it a more suitable choice for this specific problem and dataset.
*